{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "\n",
    "<a href='https://www.udemy.com/user/joseportilla/'><img src='../Pierian_Data_Logo.png'/></a>\n",
    "___\n",
    "<center><em>Content Copyright by Pierian Data</em></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Working with PDF Files\n",
    "\n",
    "Welcome back Agent. Often you will have to deal with PDF files. There are [many libraries in Python for working with PDFs](https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167), each with their pros and cons, the most common one being **pypdf**. You can install it with:\n",
    "\n",
    "    pip install pypdf\n",
    "    \n",
    "Keep in mind that not every PDF file can be read with this library. PDFs that are too blurry, have a special encoding, encrypted, or maybe just created with a particular program that doesn't work well with pypdf won't be able to be read. If you find yourself in this situation, try using the libraries linked above, but keep in mind, these may also not work. The reason for this is because of the many different parameters for a PDF and how non-standard the settings can be, text could be shown as an image instead of a utf-8 encoding. There are many parameters to consider in this aspect.\n",
    "\n",
    "As far as pypdf is concerned, it can only read the text from a PDF document, it won't be able to grab images or other media files from a PDF.\n",
    "___\n",
    "\n",
    "## Working with pypdf\n",
    "\n",
    "Let's being showing the basics of the pypdf library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install pypdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# note the capitalization\n",
    "import pypdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading PDFs\n",
    "\n",
    "Similar to the csv library, we open a pdf, then create a reader object for it. Notice how we use the binary method of reading , 'rb', instead of just 'r'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Notice we read it as a binary with 'rb'\n",
    "f = open('Working_Business_Proposal.pdf','rb')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "pdf_reader = pypdf.PdfReader(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(pdf_reader.pages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "page_number = 0\n",
    "page_one = pdf_reader.pages[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can then extract the text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "page_one_text = page_one.extract_text()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Business Proposal  The Revolution is Coming \\nLeverage agile frameworks to provide a robust synopsis for high level \\noverviews. Iterative approaches to corporate strategy foster collaborative \\nthinking to further the overall value proposition. Organically grow the \\nholistic world view of disruptive innovation via workplace diversity and \\nempowerment. \\nBring to the table win-win survival strategies to ensure proactive \\ndomination. At the end of the day, going forward, a new normal that has \\nevolved from generation X is on the runway heading towards a streamlined \\ncloud solution. User generated content in real-time will have multiple \\ntouchpoints for offshoring. \\nCapitalize on low hanging fruit to identify a ballpark value added activity to \\nbeta test. Override the digital divide with additional clickthroughs from \\nDevOps. Nanotechnology immersion along the information highway will \\nclose the loop on focusing solely on the bottom line. \\nPodcasting operational change management inside of workﬂows to \\nestablish a framework. Taking seamless key performance indicators ofﬂine \\nto maximise the long tail. Keeping your eye on the ball while performing a \\ndeep dive on the start-up mentality to derive convergence on cross-\\nplatform integration. \\nCollaboratively administrate empowered markets via plug-and-play \\nnetworks. Dynamically procrastinate B2C users after installed base \\nbeneﬁts. Dramatically visualize customer directed convergence without \\nrevolutionary ROI. \\nEfﬁciently unleash cross-media information without cross-media value. \\nQuickly maximize timely deliverables for real-time schemas. Dramatically \\nmaintain clicks-and-mortar solutions without functional solutions. \\nBUSINESS PROPOSAL ! 1'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "page_one_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Adding to PDFs\n",
    "\n",
    "We can not write to PDFs using Python because of the differences between the single string type of Python, and the variety of fonts, placements, and other parameters that a PDF could have.\n",
    "\n",
    "What we can do is copy pages and append pages to the end."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "f = open('Working_Business_Proposal.pdf','rb')\n",
    "pdf_reader = pypdf.PdfReader(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "page_number = 0\n",
    "page_one = pdf_reader.pages[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "pdf_writer = pypdf.PdfWriter()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "pdf_writer.add_page(page_one);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "pdf_output = open(\"Some_New_Doc.pdf\",\"wb\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(False, <_io.BufferedWriter name='Some_New_Doc.pdf'>)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pdf_writer.write(pdf_output)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have copied a page and added it to another new document!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Simple Example\n",
    "\n",
    "Let's try to grab all the text from this PDF file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "f = open('Working_Business_Proposal.pdf','rb')\n",
    "\n",
    "# List of every page's text.\n",
    "# The index will correspond to the page number.\n",
    "pdf_text = []\n",
    "\n",
    "pdf_reader = pypdf.PdfReader(f)\n",
    "\n",
    "for p in range(len(pdf_reader.pages)):\n",
    "    \n",
    "    page = pdf_reader.pages[0]\n",
    "    \n",
    "    pdf_text.append(page.extract_text())\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Business Proposal  The Revolution is Coming \\nLeverage agile frameworks to provide a robust synopsis for high level \\noverviews. Iterative approaches to corporate strategy foster collaborative \\nthinking to further the overall value proposition. Organically grow the \\nholistic world view of disruptive innovation via workplace diversity and \\nempowerment. \\nBring to the table win-win survival strategies to ensure proactive \\ndomination. At the end of the day, going forward, a new normal that has \\nevolved from generation X is on the runway heading towards a streamlined \\ncloud solution. User generated content in real-time will have multiple \\ntouchpoints for offshoring. \\nCapitalize on low hanging fruit to identify a ballpark value added activity to \\nbeta test. Override the digital divide with additional clickthroughs from \\nDevOps. Nanotechnology immersion along the information highway will \\nclose the loop on focusing solely on the bottom line. \\nPodcasting operational change management inside of workﬂows to \\nestablish a framework. Taking seamless key performance indicators ofﬂine \\nto maximise the long tail. Keeping your eye on the ball while performing a \\ndeep dive on the start-up mentality to derive convergence on cross-\\nplatform integration. \\nCollaboratively administrate empowered markets via plug-and-play \\nnetworks. Dynamically procrastinate B2C users after installed base \\nbeneﬁts. Dramatically visualize customer directed convergence without \\nrevolutionary ROI. \\nEfﬁciently unleash cross-media information without cross-media value. \\nQuickly maximize timely deliverables for real-time schemas. Dramatically \\nmaintain clicks-and-mortar solutions without functional solutions. \\nBUSINESS PROPOSAL ! 1',\n",
       " 'Business Proposal  The Revolution is Coming \\nLeverage agile frameworks to provide a robust synopsis for high level \\noverviews. Iterative approaches to corporate strategy foster collaborative \\nthinking to further the overall value proposition. Organically grow the \\nholistic world view of disruptive innovation via workplace diversity and \\nempowerment. \\nBring to the table win-win survival strategies to ensure proactive \\ndomination. At the end of the day, going forward, a new normal that has \\nevolved from generation X is on the runway heading towards a streamlined \\ncloud solution. User generated content in real-time will have multiple \\ntouchpoints for offshoring. \\nCapitalize on low hanging fruit to identify a ballpark value added activity to \\nbeta test. Override the digital divide with additional clickthroughs from \\nDevOps. Nanotechnology immersion along the information highway will \\nclose the loop on focusing solely on the bottom line. \\nPodcasting operational change management inside of workﬂows to \\nestablish a framework. Taking seamless key performance indicators ofﬂine \\nto maximise the long tail. Keeping your eye on the ball while performing a \\ndeep dive on the start-up mentality to derive convergence on cross-\\nplatform integration. \\nCollaboratively administrate empowered markets via plug-and-play \\nnetworks. Dynamically procrastinate B2C users after installed base \\nbeneﬁts. Dramatically visualize customer directed convergence without \\nrevolutionary ROI. \\nEfﬁciently unleash cross-media information without cross-media value. \\nQuickly maximize timely deliverables for real-time schemas. Dramatically \\nmaintain clicks-and-mortar solutions without functional solutions. \\nBUSINESS PROPOSAL ! 1',\n",
       " 'Business Proposal  The Revolution is Coming \\nLeverage agile frameworks to provide a robust synopsis for high level \\noverviews. Iterative approaches to corporate strategy foster collaborative \\nthinking to further the overall value proposition. Organically grow the \\nholistic world view of disruptive innovation via workplace diversity and \\nempowerment. \\nBring to the table win-win survival strategies to ensure proactive \\ndomination. At the end of the day, going forward, a new normal that has \\nevolved from generation X is on the runway heading towards a streamlined \\ncloud solution. User generated content in real-time will have multiple \\ntouchpoints for offshoring. \\nCapitalize on low hanging fruit to identify a ballpark value added activity to \\nbeta test. Override the digital divide with additional clickthroughs from \\nDevOps. Nanotechnology immersion along the information highway will \\nclose the loop on focusing solely on the bottom line. \\nPodcasting operational change management inside of workﬂows to \\nestablish a framework. Taking seamless key performance indicators ofﬂine \\nto maximise the long tail. Keeping your eye on the ball while performing a \\ndeep dive on the start-up mentality to derive convergence on cross-\\nplatform integration. \\nCollaboratively administrate empowered markets via plug-and-play \\nnetworks. Dynamically procrastinate B2C users after installed base \\nbeneﬁts. Dramatically visualize customer directed convergence without \\nrevolutionary ROI. \\nEfﬁciently unleash cross-media information without cross-media value. \\nQuickly maximize timely deliverables for real-time schemas. Dramatically \\nmaintain clicks-and-mortar solutions without functional solutions. \\nBUSINESS PROPOSAL ! 1',\n",
       " 'Business Proposal  The Revolution is Coming \\nLeverage agile frameworks to provide a robust synopsis for high level \\noverviews. Iterative approaches to corporate strategy foster collaborative \\nthinking to further the overall value proposition. Organically grow the \\nholistic world view of disruptive innovation via workplace diversity and \\nempowerment. \\nBring to the table win-win survival strategies to ensure proactive \\ndomination. At the end of the day, going forward, a new normal that has \\nevolved from generation X is on the runway heading towards a streamlined \\ncloud solution. User generated content in real-time will have multiple \\ntouchpoints for offshoring. \\nCapitalize on low hanging fruit to identify a ballpark value added activity to \\nbeta test. Override the digital divide with additional clickthroughs from \\nDevOps. Nanotechnology immersion along the information highway will \\nclose the loop on focusing solely on the bottom line. \\nPodcasting operational change management inside of workﬂows to \\nestablish a framework. Taking seamless key performance indicators ofﬂine \\nto maximise the long tail. Keeping your eye on the ball while performing a \\ndeep dive on the start-up mentality to derive convergence on cross-\\nplatform integration. \\nCollaboratively administrate empowered markets via plug-and-play \\nnetworks. Dynamically procrastinate B2C users after installed base \\nbeneﬁts. Dramatically visualize customer directed convergence without \\nrevolutionary ROI. \\nEfﬁciently unleash cross-media information without cross-media value. \\nQuickly maximize timely deliverables for real-time schemas. Dramatically \\nmaintain clicks-and-mortar solutions without functional solutions. \\nBUSINESS PROPOSAL ! 1',\n",
       " 'Business Proposal  The Revolution is Coming \\nLeverage agile frameworks to provide a robust synopsis for high level \\noverviews. Iterative approaches to corporate strategy foster collaborative \\nthinking to further the overall value proposition. Organically grow the \\nholistic world view of disruptive innovation via workplace diversity and \\nempowerment. \\nBring to the table win-win survival strategies to ensure proactive \\ndomination. At the end of the day, going forward, a new normal that has \\nevolved from generation X is on the runway heading towards a streamlined \\ncloud solution. User generated content in real-time will have multiple \\ntouchpoints for offshoring. \\nCapitalize on low hanging fruit to identify a ballpark value added activity to \\nbeta test. Override the digital divide with additional clickthroughs from \\nDevOps. Nanotechnology immersion along the information highway will \\nclose the loop on focusing solely on the bottom line. \\nPodcasting operational change management inside of workﬂows to \\nestablish a framework. Taking seamless key performance indicators ofﬂine \\nto maximise the long tail. Keeping your eye on the ball while performing a \\ndeep dive on the start-up mentality to derive convergence on cross-\\nplatform integration. \\nCollaboratively administrate empowered markets via plug-and-play \\nnetworks. Dynamically procrastinate B2C users after installed base \\nbeneﬁts. Dramatically visualize customer directed convergence without \\nrevolutionary ROI. \\nEfﬁciently unleash cross-media information without cross-media value. \\nQuickly maximize timely deliverables for real-time schemas. Dramatically \\nmaintain clicks-and-mortar solutions without functional solutions. \\nBUSINESS PROPOSAL ! 1']"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pdf_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Business Proposal  The Revolution is Coming \n",
      "Leverage agile frameworks to provide a robust synopsis for high level \n",
      "overviews. Iterative approaches to corporate strategy foster collaborative \n",
      "thinking to further the overall value proposition. Organically grow the \n",
      "holistic world view of disruptive innovation via workplace diversity and \n",
      "empowerment. \n",
      "Bring to the table win-win survival strategies to ensure proactive \n",
      "domination. At the end of the day, going forward, a new normal that has \n",
      "evolved from generation X is on the runway heading towards a streamlined \n",
      "cloud solution. User generated content in real-time will have multiple \n",
      "touchpoints for offshoring. \n",
      "Capitalize on low hanging fruit to identify a ballpark value added activity to \n",
      "beta test. Override the digital divide with additional clickthroughs from \n",
      "DevOps. Nanotechnology immersion along the information highway will \n",
      "close the loop on focusing solely on the bottom line. \n",
      "Podcasting operational change management inside of workﬂows to \n",
      "establish a framework. Taking seamless key performance indicators ofﬂine \n",
      "to maximise the long tail. Keeping your eye on the ball while performing a \n",
      "deep dive on the start-up mentality to derive convergence on cross-\n",
      "platform integration. \n",
      "Collaboratively administrate empowered markets via plug-and-play \n",
      "networks. Dynamically procrastinate B2C users after installed base \n",
      "beneﬁts. Dramatically visualize customer directed convergence without \n",
      "revolutionary ROI. \n",
      "Efﬁciently unleash cross-media information without cross-media value. \n",
      "Quickly maximize timely deliverables for real-time schemas. Dramatically \n",
      "maintain clicks-and-mortar solutions without functional solutions. \n",
      "BUSINESS PROPOSAL ! 1\n"
     ]
    }
   ],
   "source": [
    "print(pdf_text[3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Excellent work! That is all for pypdf for now, remember that this won't work with every PDF file and is limited in its scope to only text of PDFs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:base] *",
   "language": "python",
   "name": "conda-base-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
